STA6235: Modeling in Regression
Let us now review the “checks” we will perform on our models.
Model assumptions
Outliers
Influence
Leverage
Multicollinearity
Recall the glm, y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... \beta_k x_k + \varepsilon where \varepsilon is the residual error term.
Recall that the residual error is defined as \varepsilon = y - \hat{y} where
y is the observed value
\hat{y} is the predicted value
The residual tells us how far away the observation is from the response surface (the predicted value).
Ordinary least squares estimation means that we have minimized the overall error.
\varepsilon \overset{\text{iid}}{\sim} N(0, \sigma^2)
The assumptions are on the residual!
What this means:
Residuals are normally distributed
Distribution of residuals is centered at 0
Variance of residuals is some constant \sigma^2
We will assess the assumptions graphically.
Constant variance: scatterplot
Normal distribution: q-q plot
A package was written by a former student, classpackage.
If you are working on the server, the package is already installed.
If you are not working on the server, you need to install the package:
anova_check() function.
Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings)
Residuals:
Min 1Q Median 3Q Max
-1.43672 -0.29274 -0.03135 0.26734 1.62060
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.605835 0.102452 83.999 < 2e-16 ***
season -0.093289 0.015109 -6.174 4.11e-09 ***
episode 0.013616 0.005132 2.653 0.00868 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4888 on 185 degrees of freedom
Multiple R-squared: 0.1835, Adjusted R-squared: 0.1747
F-statistic: 20.79 on 2 and 185 DF, p-value: 7.147e-09
If we want to know how well the model fits the particular dataset at hand, we can look at the R^2.
R^2 is the proportion of variance explained by the model.
Because it is a proportion, R^2 \in [0, 1] and is independent of the units of y.
If R^2 \to 0, the model does not fit the data well; if R^2 \to 1, the model fits the data well.
R^2 = \frac{\text{SSReg}}{\text{SSTot}}
Remember that we are partitioning the variability in y (SSTot), which is constant, into two pieces:
The variability explained by the regression model (SSReg).
The variability explained by outside sources (SSE).
As predictors are added to the model, we necessarily increase SSReg / decrease SSE.
We want a measure of model fit that is resistant to this fluctuation, R^2_{\text{adj}} = 1 - \left( \frac{n-1}{n-k-1} \right) \left( 1 - R^2 \right), where k is the number of predictor terms in the model.
Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings)
Residuals:
Min 1Q Median 3Q Max
-1.43672 -0.29274 -0.03135 0.26734 1.62060
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.605835 0.102452 83.999 < 2e-16 ***
season -0.093289 0.015109 -6.174 4.11e-09 ***
episode 0.013616 0.005132 2.653 0.00868 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4888 on 185 degrees of freedom
Multiple R-squared: 0.1835, Adjusted R-squared: 0.1747
F-statistic: 20.79 on 2 and 185 DF, p-value: 7.147e-09
Definition: data values that are much larger or smaller than the rest of the values in the dataset.
We will look at the standardized residuals, e_{i_{\text{standardized}}} = \frac{e_i}{\sqrt{\text{MSE}(1-h_i)}}, where
If |e_{i_{\text{standardized}}}| > 2.5 \ \to \ outlier.
If |e_{i_{\text{standardized}}}| > 3 \ \to \ extreme outlier.
We will use the rstandard() function to find the residuals.
For ease of examining in large datasets, we will use it to create a “flag.”
A leverage point is defined as follows:
An influential point is defined as follows:
We check these together using Cook’s distance.
We use the gg_cooksd() function from the lindia package to construct the graph.
ggplot() – we can make modifications like we normally do in ggplot().Collinearity/multicollinearity: a correlation between two or more predictor variables affects the estimation procedure.
We will use the variance inflation factor (VIF) to check for multicollinearity. \text{VIF}_j = \frac{1}{1-R^2_j},
where
We say that multicollinearity is present if VIF > 10.
How do we deal with multicollinearity?
Easy answer: remove at least one predictor from the collinear set, then reassess VIF.
More complicated: discussing with collaborators/bosses.
vif() function from the car package.Note: the car package overwrites some functions from tidyverse, so I typically do not load the full library.
There will be a VIF value for each predictor in the model.
We can perform sensitivity analysis to determine how much our model changes when we exclude the outliers.
Questions we will ask:
We only look at sensitivity analysis once (i.e., only remove data points once for reanalysis).
Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings)
Residuals:
Min 1Q Median 3Q Max
-1.43672 -0.29274 -0.03135 0.26734 1.62060
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.605835 0.102452 83.999 < 2e-16 ***
season -0.093289 0.015109 -6.174 4.11e-09 ***
episode 0.013616 0.005132 2.653 0.00868 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4888 on 185 degrees of freedom
Multiple R-squared: 0.1835, Adjusted R-squared: 0.1747
F-statistic: 20.79 on 2 and 185 DF, p-value: 7.147e-09
office_ratings_no_outliers <- office_ratings %>% filter(outlier == FALSE)
m2 <- lm(imdb_rating ~ season + episode, data = office_ratings_no_outliers)
summary(m2)
Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings_no_outliers)
Residuals:
Min 1Q Median 3Q Max
-0.99158 -0.28643 -0.03396 0.27160 1.19157
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.665452 0.090148 96.125 < 2e-16 ***
season -0.099590 0.013239 -7.523 2.51e-12 ***
episode 0.010129 0.004515 2.243 0.0261 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4228 on 179 degrees of freedom
Multiple R-squared: 0.2474, Adjusted R-squared: 0.239
F-statistic: 29.43 on 2 and 179 DF, p-value: 8.923e-12
Today we have covered
model assumptions
model diagnostics
sensitivity analysis
Today’s activity: